Let's imagine that we work in a startup that sells food products and our task is toinvestigate user behavior for the company's app:
The conclusions that I will draw from it will help me improve the company conversion by interpreting user behavior and clarifying the results of statistical tests.
We will use 9 libraries:
import pandas as pd
import numpy as np
import datetime
from datetime import timedelta
from datetime import datetime
import plotly.express as px
import plotly.graph_objects as go
import math as mth
from scipy import stats as st
import re
import itertools
from plotly.offline import iplot, init_notebook_mode
import matplotlib.pyplot as plt
import matplotlib as mpl
import sys
import warnings
if not sys.warnoptions:
warnings.simplefilter("ignore")
import seaborn as sns
pd.set_option('display.max_columns', 500)
pd.set_option('display.max_rows', 500)
plt.style.use('fivethirtyeight')
Let's set some parameters for ploting
mpl.rcParams['lines.linewidth'] = 2
mpl.rcParams["figure.figsize"] = [8, 6]
mpl.rcParams.update({"axes.grid": True, "grid.color": "grey"})
mpl.rcParams['image.cmap'] = 'gray'
mpl.rcParams['figure.dpi'] = 80
mpl.rcParams['savefig.dpi'] = 100
mpl.rcParams['font.size'] = 12
mpl.rcParams['legend.fontsize'] = 'large'
mpl.rcParams['figure.titlesize'] = 'medium'
try:
logs_exp = pd.read_csv('/datasets/logs_exp_us.csv', sep='\t', dtype={'EventName': 'category',
'ExpId': 'category'}) # practicum path
except:
try:
logs_exp = pd.read_csv('./datasets/logs_exp_us.csv', sep='\t', dtype={'EventName': 'category',
'ExpId': 'category'}) # local path
except:
try:
logs_exp = pd.read_csv('https://code.s3.yandex.net//datasets/logs_exp_us.csv', sep='\t', dtype={'EventName': 'category',
'ExpId': 'category'}) # loading path
except FileNotFoundError:
print('Ooops, the dateset not found.')
except pd.errors.EmptyDataError:
print('Ooops, the dataset is empty.')
Let's downcast our data so it wouldn't take to much space
logs_exp['DeviceIDHash'] = pd.to_numeric(logs_exp['DeviceIDHash'], downcast='integer')
logs_exp['EventTimestamp'] = pd.to_numeric(logs_exp['EventTimestamp'], downcast='integer')
logs_exp.info(memory_usage='deep')
We successfully opened the dataset. The dataset contains 244126 lines, 2 category columns, 2 integer columns. Let's se how we can preprocess it
logs_exp.columns = ['event_name', 'user_id', 'timestamp', 'experiment_id']
logs_exp.user_id.nunique()
logs_exp.describe(include='all')
round(logs_exp.experiment_id.value_counts(normalize=True) * 100,2)
for i in logs_exp[logs_exp.duplicated()].columns:
print(i, ':', logs_exp[logs_exp.duplicated()][i].nunique())
print(f'procentage of duplicates is {round(logs_exp.duplicated().sum() / logs_exp.shape[0] * 100,2)}%')
logs_exp = logs_exp.drop_duplicates()
logs_exp['timestamp'] = logs_exp.timestamp.apply(lambda x:datetime.fromtimestamp(x))
logs_exp['date'] = logs_exp['timestamp'].astype('datetime64[D]')
We found really small amount of duplicated lines, in real worlв we should report this, cause it means, something is wrong obtained data. We found no missing values, but created the 'timestamp' and 'date' columns which will help us later on. Also we renamed column names to officially acceped naming format. Also the proportions for experiments look equal
print(f'we have {logs_exp.event_name.nunique()} unique events and {logs_exp.shape[0]} events in general in the logs dataset')
logs_exp.user_id.nunique()
round(logs_exp.groupby('user_id')['event_name'].count().mean(),2)
print(f"The research period is from {logs_exp.date.min()} to {logs_exp.date.max()} covering {(logs_exp.date.max() - logs_exp.date.min())/ np.timedelta64(1, 'D') + 1} days")
fig = px.histogram(logs_exp, x="timestamp", title='amount of events distribution')
fig.show()
We can see really clear that August data has it's own pattern, but I want to get sure that the 1st August also belong to the pattern model. Also I want to check the hour events distribution, because I beleive that this falls happen due to night hour lower visitor flow and peaks - due to daytime visitors. What wa the reason for having such low values before August - possibly technical problems.
logs_exp['hour'] = logs_exp.timestamp.dt.round('H')
logs_exp['only_hour'] = logs_exp['hour'].dt.hour
fig = px.histogram(logs_exp, x="only_hour", title='total amount of events per hour')
fig.show()
As I expected, the night hours have the lowest amount of events. But I want to get to the mean amountnt of events when the system recieved data correctly, not the whole sum of events.
min_amount= 81
normal_hours = logs_exp.groupby(['date','only_hour'])['event_name'].count().unstack()
event_per_hour = pd.DataFrame(normal_hours[normal_hours > min_amount].mean(axis=0))
event_per_hour.columns = ['amount']
fig = px.histogram(event_per_hour, x=event_per_hour.index, y='amount', nbins=24,
labels={'amount':'average events per hour'}, # can specify one label per df column
opacity=0.8,
color_discrete_sequence=['indianred'], title='mean amount of events per hour') # color of histogram bars
fig.show()
Now I have awerage amount of events for eventsery hour and can compare it to the recieved values.
per_day_per_hour = logs_exp.groupby(['date','only_hour'])['event_name'].count().reset_index(drop=False)
per_day_per_hour = per_day_per_hour.merge(
event_per_hour, how='left', left_on='only_hour', right_on=event_per_hour.index)
per_day_per_hour.columns = ['date', 'hour', 'events_num', 'mean_events_num']
per_day_per_hour['diff'] = per_day_per_hour['mean_events_num'] -per_day_per_hour['events_num']
per_day_per_hour
per_day_per_hour['when'] = per_day_per_hour['date'].dt.day.astype('str') + ', ' + per_day_per_hour['hour'].astype('str')
Let's plot a line that would how how the recieved values varied from average
fig = px.line(per_day_per_hour,x='when', y="diff", title='Difference between regular event number and recieved')
fig.show()
This dataset contains have 5 unique events and 243713 events in general in the logs dataset with 7551 having nearly 32 events per one. The data in the dataset describes 2 weeks, but not all the days contain properly recieved information, maybe due to technical reasons. I analysed the data and founf an average hamount of events for every hour to compare with the recieved data. I can see on the graph that starting from 2019-08-01 data looks 'normal' and I choose this date ti be the fist point of properly distributed data. When the data was close to the 'average' in July, in was due to regularly low numbers events that is proven by other plots. So the data really represents just the period from 2019-08-01 to 2019-08-08.
good_date = '2019-08-01'
filtered_logs = logs_exp[logs_exp['date'] >= good_date]
bad_data = logs_exp[logs_exp['date'] < good_date]
print(f'After getting rid of bad data (but recived during the half of the whole research time), we lost only lost {round(bad_data.shape[0] / logs_exp.shape[0] * 100,2)}% of data')
print(f'We also lost {bad_data.user_id.nunique()} users')
I also want to see, what are the proportions of the remained events.
bad_data.event_name.value_counts(normalize=True) * 100
users_in_group = pd.DataFrame(filtered_logs.groupby('experiment_id')['user_id'].nunique())
fig = px.pie(users_in_group, values='user_id', names=users_in_group.index, title='Proportions experiment groups')
fig.show()
filtered_logs.groupby('experiment_id')['user_id'].nunique()
events_count = pd.DataFrame(filtered_logs.experiment_id.value_counts())
fig = px.histogram(events_count, x=events_count.index, y = 'experiment_id',
title='Number of events per group')
fig.show()
After getting rid of bad data (but recived during the half of the whole research time), we lost only lost 0.82% of data, that included visits of 1319 users. Also I checked the proportions of the event types and their order remained the same. Also I checked how many users are left in the filtered data, but they are still nicely distributed.
fig = px.pie(filtered_logs, values='user_id', names='event_name', title='Proportions of the events in the filtered dataset')
fig.show()